Structures, functions and adaptations of the human LINE-1 ORF2 protein

The LINE-1 (L1) retrotransposon is an ancient genetic parasite that has written around one-third of the human genome through a ‘copy and paste’ mechanism catalysed by its multifunctional enzyme, open reading frame 2 protein (ORF2p)1. ORF2p reverse transcriptase (RT) and endonuclease activities have been implicated in the pathophysiology of cancer2,3, autoimmunity4,5 and ageing6,7, making ORF2p a potential therapeutic target. However, a lack of structural and mechanistic knowledge has hampered efforts to rationally exploit it. We report structures of the human ORF2p ‘core’ (residues 238–1061, including the RT domain) by X-ray crystallography and cryo-electron microscopy in several conformational states. Our analyses identified two previously undescribed folded domains, extensive contacts to RNA templates and associated adaptations that contribute to unique aspects of the L1 replication cycle. Computed integrative structural models of full-length ORF2p show a dynamic closed-ring conformation that appears to open during retrotransposition. We characterize ORF2p RT inhibition and reveal its underlying structural basis. Imaging and biochemistry show that non-canonical cytosolic ORF2p RT activity can produce RNA:DNA hybrids, activating innate immune signalling through cGAS/STING and resulting in interferon production6–8. In contrast to retroviral RTs, L1 RT is efficiently primed by short RNAs and hairpins, which probably explains cytosolic priming. Other biochemical activities including processivity, DNA-directed polymerization, non-templated base addition and template switching together allow us to propose a revised L1 insertion model. Finally, our evolutionary analysis demonstrates structural conservation between ORF2p and other RNA- and DNA-dependent polymerases. We therefore provide key mechanistic insights into L1 polymerization and insertion, shed light on the evolutionary history of L1 and enable rational drug development targeting L1.


Supplementary Background and Discussion
A large fraction of eukaryotic genomes consists of mobile elements: sequences that either encode protein machinery to mediate their propagation or co-opt other mobile element proteins to copy themselves.DNA 'cut and paste' transposons, like the maize elements discovered by Barbara McClintock 78 , are no longer active in primates.Instead, recent primate evolution is dominated by RNA 'copy and paste' retrotransposons, in which RNA intermediates are integrated into the genome by encoded reverse transcriptase (RT) activity 79 .These are divided into two classes: (1) long-terminal repeat (LTR) retrotransposons, also called endogenous retroviruses (ERVs), similar to HIV-1 but no longer thought active in humans, and (2) active Long INterspersed Element-1 (LINE-1, L1) non-LTR retrotransposons [80][81][82] .Previously considered 'junk DNA', L1 is the only active proteincoding human transposon and is an important endogenous mutagen 82 .
L1s are conserved to plants and thus L1s and their hosts have been co-evolving for 1-2 billion years 102 in an arms race: the transposon attempts to copy itself in a process called retrotransposition (Fig. 1a), while the host defends against this mutagenic process.Multi-layered host defenses recognize the L1 DNA and RNA sequences, proteins, and retrotransposition intermediates, notably including p53 103 , which may have evolved to suppress mobile elements 83,92,[103][104][105][106][107][108] .
Biochemically, non-templated addition is also seen in retroviral RTs 138 , and the 5' RNA cap may facilitate this activity as well as base pairing to facilitate template switching or jumping 139 .In R2 these activities are partially understood mechanistically and structurally [140][141][142] .These activities are likely involved in the transition from first to second strand synthesis (Discussion).The equivalent tower lock region in R2 as that in ORF2p was previously shown to contact RNA 143 , although R2 does not have a tower and the baseplate does not have a PCNA-binding PIP box.PCNA recruits RNase H2 for efficient L1 retrotransposition 144 ; RNase H2 is mutated in the Mendelian interferonopathy Aicardi Goutières Syndrome 144 , and these patients respond clinically to RT inhibitors 145 .

Protein expression and purification
ORF2p core (residues 238-1061, tower-fingers-palm-thumb-wrist) was expressed in E. coli as an N-terminal His6-MBP fusion with a 3C protease cleavage site (pAMS823) as previously reported 132 with modification.Cells were lysed in a microfluidizer (Microfluidics) in 500 mM NaCl, 10% glycerol, 1 mM TCEP, 25 mM Imidazole, and 50 mM HEPES pH 8.0, purified by Ni-NTA and heparin affinity, tag cleaved using 3C protease, protease removed using heparin affinity, and polished using size exclusion on a Superdex 200 column (Cytiva) in SEC buffer (500 mM NaCl, 5% glycerol, 2 mM MgCl2, 0.5 mM TCEP, and 20 mM HEPES pH 8.0) with monodisperse fractions corresponding to the theoretical mobility of a monomer at 97 kDa.Mutant and subsequent WT ORF2p core proteins were purified similarly but with C-terminal His8 and lacking the N-terminal MBP.For crystallography, ORF2p-His8 core was purified as above but the final size exclusion polishing step used low-salt SEC buffer (150 mM NaCl instead of 500 mM) and the pooled fractions were concentrated to 5-6mg/ml, aliquoted and flash frozen in liquid nitrogen.Full-length ORF2p (1-1275) using a codon-optimized ORFeus-Hs sequence 148 and a Cterminal 3C-3xFlag tag 92 was cloned into a customized insect vector pDARMO-PolH2.1 (pMT692) 149 , expressed in SF9 insect cells using the MultiBac EMBacY system 150 (Geneva Biotech), purified by Flag and Heparin affinity, and polished on size exclusion on a Superdex 200 column (Cytiva) in SEC buffer, with monodisperse fractions corresponding to the theoretical mobility of a monomer at ~150 kDa used for further structural experiments.For single nucleotide gel-based assays, HIV and HERV-K RTs were expressed and purified from SF9 insect cells using the MultiBac system, as previously reported 151 ; full-length ORF2p with C-terminal His8 tag was expressed and purified analogously, as a fusion polyprotein containing N-terminal HERV-K and TEV proteases followed by TEV cleavage site (ENLYFQG) to facilitate post-translational processing, which results in a single glycine residue at the N terminus.
Crystallization and structure determination of the ORF2p-8His core Chain-terminated hybrid duplex was prepared by incubating RNA-template and DNA-primer oligos at 95°C for 3 mins and cooling to 4°C over 1 hour (oligos supplied by IDT: DNA-5'GCGCTTTC[ddC]-3' / RNA-5'-UUAGGAAAGCGC-3'). Aliquots of ORF2p-His8 core were thawed, allowed to equilibrate to room temperature, diluted to 3 mg/mL with 50 mM NaCl, mixed with 2 mM MgCl2, 2 mM dTTP and a 1.3:1 molar ratio of hybrid duplex.The resulting complex was incubated at room temperature for 30 minutes and used to set up a range of commercial sparse matrix crystallization screens.The initial hit was obtained in Proplex screen (Molecular Dimensions), condition D7 (0.1 M sodium citrate pH 5.5 and 15% PEG6000).These crystals were small, soft, difficult to handle and only diffracted to ~3.7 Å resolution, and data were also highly anisotropic.Sequential gridscreen optimizations were conducted to optimize pH, PEG molecular weight, PEG concentration and protein:well solution mixing ratio.Different combinations of organic solvents and salts were also extensively screened both as crystallization additives and in combination with additional PEG as post-growth order enhancement systems.The final crystals used to generate the data presented here were grown from 18% PEG8000, 0.1 M sodium citrate pH 5.6, 0.2 M NaCl, 10% DMSO and 5% 1,4-dioxane.For data collection, crystallization drops were layered with stabilizing solution (27.5% PEG8000, 20% DMSO, 0.05 M sodium citrate pH5.6) and incubated for 1 hour prior to harvesting by immersion in liquid nitrogen.Optimized crystals diffracted to ~2.1 Å but still exhibited up to 1.0 Å difference in resolution between the best and worst reciprocal lattice directions.Merging multiple datasets was found to greatly reduce this axial resolution gap.Final data, derived from merging six crystals, have <0.4Å variation between best and worst resolution limits.All data were collected at Diamond Synchrotron, Beamline I03 (λ=0.976Å), using a Dectris Eiger2 XE 16M detector.Datasets were indexed and integrated with DIALS, scaled and merged with Aimless and phased by molecular replacement with Phaser using AlphaFold model 152 AF-000370-F1 truncated to residues 238-1061 and with the tower domain removed from the search model.The structural model was rebuilt using Coot 153 and refined with Buster 154 .The final structure has Ramachandran angles favored/allowed/outlier (%) of 96.39/3.61/0.00 and further refinement statistics are found in Extended Data Table 1.Contact analysis between ORF2p and ligands was performed by the PLIP server 155 and manually checked with cutoff of 2.5-3.3Å for polar interactions and 3.7 Å for van der Waals interactions; dTTP identified contacts contain both incoming nucleotide and bound magnesium 156 .

ORF2p reverse transcriptase activity assays
Microwell assays were performed using the reverse transcriptase assay, colorimetric (Roche) according to the manufacturer's instructions, with the supplied poly(A) template and oligo(dT)15 primer.ORF2p fractions were diluted for assay in lysis/binding buffer (50 mM Tris, 80 mM potassium chloride, 2.5 mM DTT, 0.75 mM EDTA, and 0.5% Triton X-100; pH 7.8) and incorporation of digoxigenin-and biotin-labeled dUTP into DNA was measured by absorbance at 405 nm as compared to a 490 nm reference.Gel-based RT activity assays consisted of pre-incubating RTs with annealed DNA/RNA, DNA/DNA, or RNA/RNA 5'-end-radiolabeled or 5'-end-Cy5-or FAM-labeled template:primer duplex or hybrid duplex and, where indicated, inhibitor in the presence of 0.1-1 μM dNTP or NTP mixture, 0.25 mM EDTA, 50 mM NaCl, and 25 mM Tris (pH 8) for 10 min at 37°C.Labeled nucleic acids were purchased from Dharmacon or IDT.Unless otherwise indicated, 15 μL reactions were initiated by the addition of 1.3 mM MgCl2, incubated for 10 min at 37 °C, and then stopped by the addition of 15 μL of formamide/EDTA (25 mM) mixture and incubated at 95 °C for 10 min.3 μL reaction samples were subjected to denaturing 8 M urea 20% PAGE to resolve products followed by signal quantification (ImageQuant 5.2, GE Healthcare Bio-Sciences) through phosphorimaging (Amersham Typhoon 5, Cytivia).Scanned gel images are cropped and corrected for distortion artifacts with contrast uniformly increased to facilitate the visualization of minor products; original images are provided in an Extended Data file.
For HTRF RT assays 157 , 25 nM ORF2p core and 12.5 nM template:primer was incubated at 25°C for 60 minutes with 10 nM of fluorescein-12-dUTP (Thermo), 1 µM each (dATP,dCTP,dGTP), and test compound in a 15 µL reaction with buffer containing 50 mM Tris-HCl, 50 mM KCl, 10 mM MgCl2, 10 mM DTT, pH 8.1, and 1% final DMSO in 384-well format in duplicate.5 µL detection reagent was added (streptavidin-terbium cryptate, 20 mM EDTA in PPI buffer, Cisbio Bioassay), and the mixture was incubated at 25 °C for 30 minutes.Fluorescence was then read at ex/em=337/485 nm and ex/em=337/520 nm on an Envision 2104 plate reader (Perkin Elmer).The fluorescence ratio at 520/485 nm was used to calculate inhibition, with the DMSO sample as 0% inhibition and no enzyme as 100% inhibition.IC50 was calculated with a 4-parameter non-linear regression equation.Template:primer mixtures were pre-annealed for 60 min at room temperature and consisted of poly(rA45) and biotin-oligo(dT)16 (Generay Biotech) for NNRTIs.For NRTIs, the following template:primer pair was instead used:

Interferon reporter assay in THP1 cells
The type I interferon response was evaluated using THP1-Dual and THP1-Dual KO-TREX1 cells (InvivoGen), which secrete a Lucia luciferase reporter gene under control of an interferon-responsive promoter.THP1-Dual KO-TREX1 cell were generated by stable biallelic knock-out of the TREX1 gene.Cell were treated with a dose titration of test compound in the presence of 1 µM 5-aza-2ʹ-deoxycytidine (decitabine, Sigma, #189825), which de-represses LINE-1 121 .Type 1 Interferon and cell viability were assessed after five days of treatment.QUANTI-LUC solution containing stabilizer was added to the cell supernatant and luminescence was measured on a plate reader, and cells were assessed for cell viability using CellTiter-Glo (Promega, #G9683) according to the manufacturer's instructions.

LINE-1 dual luciferase retrotransposition assay
To assess the potency of inhibiting LINE-1 retrotransposon, a stable clonal dual luciferase L1 reporter cell line was generated and reported as described 130,133 in the HeLa Tet-On 3G cell line (Takara, ).SB100x 161 was used to integrate pRT006.2, a vector similar to pYX056 130 , which contains a bi-directional Tet-On promoter expressing both control Renilla luciferase and LINE-1 ORFeus-Hs Firefly luciferase antisense intron (AI) reporter 79,162 .A single cell clone was selected with the highest doxycycline-induced luciferase signal vs baseline.Cells were mixed with compounds and induced for reporter expression with 500 ng/mL doxycycline (Sigma, #D9891) for 72 hours.Luminescence was measured using the Dual-Glo Luciferase Assay System (Promega, #E2940) following the manufacturer's instructions, and the ratio of Firefly to Renilla Luciferase activity was used to measure retrotransposition.

Telomerase activity assay
The human telomerase assay was performed with telomerase in MCF-7 cell lysates using the Telo TAGGG Telomerase PCR ELISAplus kit (Roche).Test compounds (NTPs) were serially diluted in water, mixed with 0.2 µg of MCF-7 lysate, and pre-incubated at room temperature for 15 minutes.Then the reaction was carried out for 30 minutes, amplified using PCR, and visualized colorimetrically per the manufacturer's instructions.

Differential scanning fluorimetry
Lyophilized oligos for differential scanning fluorimetry (DSF) were reconstituted in RNase-free TE to 500 μM.To form a hybrid, an equimolar ratio of DNA primer (oligos supplied by IDT, 5'-GCGAAAAATTTCG[ddC]-3') and RNA template (5'-GGAGCGAAAUUUUUCGC-3') was mixed and diluted in DSF buffer (20mM HEPES-KOH pH 7.6, 100 mM sodium chloride, 1mM DTT, 2mM magnesium acetate) to a final concentration of 25 µM.Oligos were then annealed by heating them to 95°C and cooling them in a step gradient of 10°C every five minutes until 5°C in a thermocycler.Purified L1 ORF2p core protein was diluted in DSF buffer to a final concentration of 1 μM in the presence or absence of 5 μM RNA or DNA/RNA hybrid.Nineteen microliters per well of buffer only, protein or protein-nucleic acid mixture were transferred to a 384-well plate to which 1 μL of fivefold SYPRO Orange (Thermo Fisher S6650) was added.Fluorescence measurements were obtained using a TAQMAN 7900 QPCR (Life Technologies) machine monitoring the fluorescent signal at 570 nm over a temperature ramping from 20°C to 95°C.Melting temperatures (Tm) were calculated using DSF World 163 using sigmoid fitting and the normalized curves were plotted using Prism (GraphPad).
Crosslinking mass spectrometry DNA-RNA hybrid was produced by resuspending the individual DNA and RNA oligos (sequences as in the cryo-EM duplex) in 500 mM NaCl to a final concentration of 500 μM.These solutions were mixed 1:1 (final concentration 250 μM) and annealed in a thermocycler as follows: 5 min at 95 °C, 45 min ramp to 25 °C and then 10 min ramp to 4 °C.
Purified full-length ORF2p and ORF2p core in SEC Buffer were crosslinked using BS3 (bis(sulfosuccinimidyl)suberate; ThermoFisher Scientific, #21580), with and without the addition of DNA:RNA hybrid, using a final protein concentration of 1 μg/μL in 500 mM NaCl (and 2.7 mM HEPES pH 8, 0.7% glycerol (v/v), 0.07 mM TCEP, 0.27 mM MgCl2).To the samples containing DNA:RNA, the hybrid was at 1.5:1 molar ratio to ORF2p, with 2 mM dTTP.The mixtures were incubated for 1 hour on ice, prior to initiating crosslinking.BS3 solutions were prepared at different concentrations and added to the reaction mixtures accordingly, which were agitated in a thermal mixer at 750 RPM, 23 °C for 3 min.Crosslinking reactions were quenched by adding Tris to a final concentration of 100 mM from a stock solution of 500 mM NaCl, 500 mM Tris pH 8.0, and incubated at room temperature for 15 minutes.
For tryptic digestion and sample cleanup prior to LC-MS/MS analysis, the quenched crosslinking reactions were first dried down using a centrifugal vacuum concentrator.The dried reaction products were resuspended in 25 μL of S-trap 'high recovery' solution (5% SDS, 8 M urea, 100 mM glycine pH 7.55), reduced (TCEP 5 mM, 55ºC, 15 minutes), alkylated (20 mM MMTS at room temperature for 10 minutes) and Lys-C/trypsin (Promega, #V5071) digested on S-trap micro columns (Protifi) following the manufacturer's instructions.Eluted, digested peptides were dried using a centrifugal vacuum concentrator and resuspended in 25 μL of 0.1% (v/v) formic acid in water (MS grade, ThermoFisher Scientific).
Mass spectrometry of the digested reaction products was conducted on a Thermo Scientific Orbitrap Exploris 480.The mobile phase consisted of 0.1% (v/v) formic acid in water (A) and 0.1% (v/v) formic acid in acetonitrile (B).Samples were loaded using a Dionex Ultimate 3000 HPLC system onto a 75 μm x 50 cm Acclaim PepMapTM RSLC nanoViper column filled with 2 µm C18 particles (ThermoFisher Scientific, #164540) using a 60 min LC-MS method at a flow rate of 0.3 µL/min as follows: 3% B over 3 min; 3 to 50% B over 45 min; 50 to 80% B over 2 min; then wash at 80% B over 5 min, 80 to 3% B over 2 min and then the column was equilibrated with 3% B for 3 minutes (MS data were acquired over the entire program, including the wash).For precursor peptides and fragmentation detection on the mass spectrometer, MS1 survey scans (m/z 375 to 1500) were performed at a resolution of 120,000 with a 300% normalized AGC target.Peptide precursors from charge states 2-6 were sampled for MS2 using DDA.For MS2 scan properties, HCD was used, and the fragments were analyzed in the Orbitrap with a collisional energy of 30%, resolution of 15,000, standard AGC target, and a maximum injection time of 50 ms.
RAW data was searched using pLink 2.3.9 164 , MaxLynx (MaxQuant 2.1.4.0) 165 , and Proteome Discoverer 2.4 with the XlinkX plugin 166 .Among the search parameters, a maximum of three missed cleavages were allowed, and a static modification on cysteines corresponding to thiomethylation by MMTS.The max false discovery rate was set to 1%.Crosslinks found in automated searches were manually validated by inspecting MS2 spectra signal-to-noise and percentage of b and y fragments detected (Supplementary Table 1).Concentrations of BS3 crosslinker were 10 and 30 µM for ORF2p core and 30 and 100 µM for full-length ORF2p.A raw list of crosslinks, initially identified with pLink, was filtered with the following conditions: (i) crosslink had to be identified by at least one other engine (Proteome Discoverer or MaxLynx), (ii) crosslinked residues had to be observed directly, or fragments must cover more than 50% of the crosslinked peptide.Duplicate residue pairs (which sometimes corresponded to different peptides) were removed and filtered crosslinks were then divided into 3 lists: (1) present only in the core, (2) present only in full-length, (3) present in both.

Cryo-EM sample preparation and data collection
Samples for cryo-TEM studies were prepared by mixing purified ORF2p with 1.5x molar excess of annealed heteroduplex (oligos supplied by IDT: DNA-5'GCGAAAATTTCG[ddC]-3' / RNA-5'-GGAGCGAAAUUUUCGC-3') or single stranded poly(A)25 and diluted to a final concentration of 0.15 mg/mL with EM buffer (20 mM HEPES pH 7.6, 150 mM sodium chloride, 2 mM magnesium acetate, 2 mM DTT) and 2.5 mM dTTP.ORF2p core and mixed nucleic acids were incubated on ice for 15 minutes to allow for equilibration prior to preparation of grids.A combination of R1.2/1.3Quanitfoil 300 mesh and R0.6/1 200 mesh holey carbon grids were glow discharged for 60 seconds using an a Pelco easiGlow glow discharger.Vitrified grids were prepared by applying 2µL of ORF2p core with or without bound nucleic acid to grids, blotting manually for 2 seconds (200 mesh) or 3 seconds (300 mesh) from behind grids with Whatman 41 grade filter paper and plunging into liquid ethane using LeicaEM CPC manual plunger.Grids were prepared in batches and screened with Talos Artica at the Rockefeller University Evelyn Gruss Lipper Cryo-electron Microscopy Resource Center.
An initial dataset of 9442 micrographs of ORF2p core-template:primer was collected using a spherical aberration corrected 300 kV Titan Krios (Thermo Fisher Scientific) equipped with a GIF BioQuantum and K3 camera (Gatan).Micrographs were taken using with SerialEM 167 at a nominal magnification of 105,000x in superresolution mode at a nominal pixel size of 0.43 Å/pixel over a defocus range of −0.8 to −2.5 μm with a step size of 0.1 μm and using a 20 eV energy filter slit.Movies were recorded with a dose per frame of 1.08 e − /Å 2 in dosefractionation mode with 50 subframes over a 2 second exposure to give a total electron flux of approximately 54 e − /Å 2 .After processing these data (described in detail below) a slightly anisotropic reconstruction was obtained, with cryoEF 168 detecting a minor gap in Fourier space and calculating a tilt angle of 30 degrees to fill in.A second dataset of 1828 micrographs using the same data collection parameters and 30-degree tilt was collected and combined with untilted data.A similar approach was taken for single stranded oligo(A)25 sample, where an initial untitled dataset of 5815 micrographs and then a 30-degree tilted dataset of 6809 micrographs were collected.ORF2p core-oligo(A)25 data were collected using a 300 kV Titan Krios (Thermo Fisher Scientific) equipped with a GIF BioQuantum and K3 camera (Gatan).Micrographs were taken with Leginon 169 in counted mode at a nominal pixel size of 0.826 Å/pixel over a defocus range of −1.0 to −2.75 μm with a step size of 0.25 μm and using a 20 eV energy filter slit.200 mesh grids were primarily used for tilted data collection due to larger mesh areas.Movies were recorded with a dose per frame of 1.16 e − /Å 2 in dose-fractionation mode with 48 subframes over a 2.2 second exposure to give a total electron flux of approximately 54 e − /Å2.A single untitled dataset for apo ORF2p core was collected using a 300 kV Titan Krios (Thermo Fisher Scientific) equipped with a GIF BioQuantum and K3 camera (Gatan).Micrographs were taken using with SerialEM 167 at a nominal magnification of 130,000x in super-resolution mode at a nominal pixel size of 0.325 Å/pixel over a defocus range of −1.0 to −2.8 μm with a step size of 0.2 μm and using a 20 eV energy filter slit.Movies were recorded with a dose per frame of 1.32 e − /Å 2 in dose-fractionation mode with 38 subframes over a 2 second exposure to give a total electron flux of approximately 51 e − /Å 2 .

Single particle analysis of cryo-EM data
The untilted ORF2p core-template:primer processed independently initially as follows.Dose-fractionated movies were gain-normalized, motion-corrected and dose-weighted using MotionCor2 170 and then imported into cryoSPARC v.3.1.0 171for downstream processing starting with contrast transfer function (CTF) correction with patch CTF estimation.A particles from subset of 2000 micrographs were picked using cryoSPARC blob picker and subjected to reference free 2D classification.The consistent classes from 2D classification were used as templates for template-based picking on all micrographs, with picked particles subjected to reference free 2D classification.Particles from self-consistent classes were selected and subjected to ab initio model generation and then three rounds of heterogenous refinement.The highest quality reconstruction, comprising 255,612 particles, was subset and refined using non-homogenous refinement 172 , resulting in a reconstruction at 3.49 Å resolution.Fourier coverage appeared incomplete and cryoEF 168 was used to determine an optimal tilt angle for additional data collection.
The final datasets for the three samples were processed in a similar fashion.Movies were motion-and CTFcorrected as described above.2D classes from the untilted ORF2p core-template:primer were used to template pick each dataset and particles were subjected to 2D classification and ab initio model generation independently.Particles from tilted and untilted datasets were combined at this point for heterogenous refinement.The particles from the highest quality reconstruction in each combined dataset was transferred to Relion v3.1 173 using pyem 174 .Combined particle sets were extracted in in Relion from micrographs that were CTF corrected with CTFFIND 4.1 175 and subjected to 3D classification with or without alignment.Selected classes were then processed using iterative rounds of 3D auto-refinement, Bayesian polishing and CTF refinement.Particle orientations and CTF parameters were imported back into cryoSPARC and a final refinement was generated using non-uniform refinement.Maps for ORF2p core-template:primer and -poly(A)25 were postprocessed with both global B factor sharpening and locally sharpened with deepEMhancer 176 with both postprocessed maps and unfiltered halfmaps deposited in EMDB.Apo ORF2p core was low pass filtered using the Volume Utility in cryoSPARC.Data processing steps and map validation are presented in detail in Supplementary Figs.1-2.The ORF2p crystal structure was from this study was used as the starting model for model building and refinement using Coot 153 and Phenix 177 , respectively.Structural models were generated for ORF2p core bound to RNA:DNA hybrid and ssRNA and summary statistics for maps and models are found in Extended Data Table 2.

Negative stain TEM of full-length Orf2p
Full-length ORF2p for negative stain TEM was prepared by adding 1.5x molar excess of RNA template:DNA primer hybrid or L376 RNA to full-length ORF2p at a final protein concentration of 0.10 mg/mL.After equilibration, 2 μL full-length ORF2p was applied to glow-discharged carbon-coated copper grids and stained with 1% uranyl acetate.Grids were imaged with a FEI Tecnai GA Spirit BioTwin TEM with AMT BioSprint 29 camera.Particles were picked and 2D classes generated using the sphire software suite 178 .Class averages were postprocessed in EMAN2 179 prior to being passed to IMP.L376 RNA was produced by run-off transcription using T7 RNA polymerase from pBS27 digested with BsaI, which produces a 376 nt RNA corresponding to the last 362 residues of L1RP (His1224 through the end of the 3' UTR) with a 14 A tail.

Integrative structure modeling of the ORF2p
Integrative structure determination proceeded through the standard four stages [180][181][182] : (1) gathering data, (2) representing subunits and translating data into spatial restraints, (3) configurational sampling to produce an ensemble of structures that satisfies the restraints, and (4) analyzing and validating the ensemble structures and data.The data should be understood in a broad sense and can include results of other modeling experiments following the same four-step approach, forming a hierarchical structure.The integrative structure modeling protocol (i.e., stages 2, 3, and 4) (Supplementary Table 2) was scripted using the Python Modeling Interface (PMI) package, a library for modeling macromolecular complexes based on our open-source Integrative Modeling Platform (IMP) package 183 and executed in IMP 2.18.
For some analyses and visualization, we computed an atomic model from a coarse-grained integrative structure model by expanding the bead positions into the full-backbone structure 184 , adding sidechains 185 and optimizing stereochemistry 186 .Structural analyses were performed with GROMACS 187 built-in tools and Python scripts using the MDanalysis v2.4.3 188 and ProDy v2.4 189 libraries.Particle radius was measured as the largest distance between the center of mass of an image and all non-zero pixels.
Modeling of ddTTP, d4T, and AZT bound to L1 RT The L1 RT crystal structure in complex with dTTP was prepared with the Protein Preparation Workflow in Maestro (Schrödinger Suite version 2023-1) using default parameters to fill in missing side chains, optimize hydrogen bond assignments, and minimize the structure (convergence to 0.3 Å RMSD for heavy atoms).The structures for ddTTP, d4T and AZT were built by modifying the dTTP structure present in the L1 RT crystal structure.AZT bound to ORF2p was compared to the structure of HIV-1 RT bound to AZT 190 .The OPLS4 force field was customized for the ligands of interest using the Force Field Builder in Maestro with S-ANSI theory level (neutral structures) for geometry optimization.The newly built ligands were minimized in the context of L1 RT structure using the dTTP crystallographic binding mode as a starting pose.Appearances of clashes, which were only observed for AZT, were followed by minimization of the protein residues around the clash to attempt to relax the structure.
Relative free energy of binding Calculations FEP+ (Schrödinger Suite version 2023-1) was used to construct a perturbation map including dTTP, ddTTP, d4T, and AZT in the context of the L1 RT crystal structure.The default perturbation protocol was used for the following pairs: dTTP/ddTTP, ddTTP/d4T, and d4T/dTTP; with 12 λ-windows and 10 ns of simulation per window.Perturbations including AZT (dTTP/AZT and ddTTP/AZT) used the Charge-hopping protocol with 24 λ-windows and 10 ns of simulation per window.The previously customized OPLS4 forcefield was used to carry out the FEP+ calculation of relative binding free energy and values were reported as ∆∆G changes with respect to ddTTP.

Evolutionary analysis
Our principal aim is to infer evolutionary similarity via protein structure, as has been done utilizing sequence.There is a fundamental issue with alignments, in that there is a trade-off between the coverage of an alignment and the quality of an alignment.We address this issue using information theory, building upon previous efforts 191 to derive distance metrics which can inform evolutionary similarity in groups of proteins.

Regions of conservation based on the sequence and structure of ORF2p.
We measured conservation against a curated set of 55 ORF2p sequences from vertebrates, including human ORF2p 192 , to which we added LINE-1 sequences from 3 plants (corn, rice, and Arabidopsis thaliana, GenBank Y00086.1,AAG13524.1,and PIR: S65812, respectively).We computed a per-residue Shannon entropy of the aligned residues by both a multiple sequence alignment and a multiple structure alignment.The higher the entropy, the less conserved the residue.We conducted the multiple sequence alignment using Clustal Omega version 1.2.4 193 using default settings.We conducted the multiple structure alignment utilizing the MUSTANG algorithm version 3.2.4 194using default settings.The Shannon entropy was computed for each aligned ORF2p residue index in multiple sequence/structure alignment as: For correlation to the scanning tri-alanine mutagenesis assay data 195 , we utilized the mean value of the %WT retrotransposition efficiency across replicates.

Evolutionary distance from other proteins.
We manually curated a set of 50 experimental protein structures which contained RTs, RdRps, a DdRp, a dual DdRp/RdRp, and a number of "controls" which should have little resemblance to the other proteins.For RT and RT-like proteins, the polypeptide with polymerase activity is used; for other proteins, the entire biological assembly is used.The curated list is available in Supplementary Table 3.We utilized the MMLigner software version 1.0.2 191to compute the alignments, enforcing the Maximum-Fragment Pair (MFP) library to have a maximum value of 5000 MFPs as we observed that for large structures, such as ORF2p (1275 amino acids in length).The default pruning was insufficient and additional pruning was required for significant alignments to be obtained.Additionally, we enforced that two proteins with no residue alignments should each contribute their null contributions.
The efficiency of an alignment can be determined via the compression for a given alignment, :

Statistics and Reproducibility
All experiments were repeated at least two or three times with similar results.All gel-based experiments were repeated at least twice (Fig. 2d; Fig. 3a-e,g; Fig. 4 b,e; Extended Data Figs.3-5, 7; Supplementary Fig. 3-5, 8).Microscopy experiments were repeated on four independent days and each condition was repeated in each experiment over at least two independent coverslips.The purification in Fig. 1c  acceptor template, resulting in a product that is a concatemer of two templates (or more, with repeated events) 142,193 .Template jumping and switching are similar but differ in that template jumps are facilitated by short (1-3 nt) microhomology that may be created by NTA, whereas template switches are blunt [139][140][141]194 . Thisactivity for ORF2p is confirmed by Sanger sequencing-like reactions, where in vitro polymerase reactions were conducted on DNA:DNA template:primers for 1 min and then continued for 5 min in 100-fold excess chain terminating dideoxy nucleotides (ddATP, ddTTP, ddCTP, ddGTP, d4T) as indicated.Complete Sanger sequencing of previously observed high molecular weight products confirms these do represent bona fide template jumps.Expected incorporation positions for ddATP, ddTTP, ddCTP, ddGTP and subsequent terminations for the first template (bottom) and second template (top) after template jumping are annotated and enlarged in inset.ORF2p core was preincubated with a template:primer for one minute at 37 °C with a dNTP mixture 1 uM supplemented with 100 uM ddNTP as labelled in 25 mM Tris-HCl (pH8) buffer, 50 mM NaCl, and 0.25 mM EDTA.Addition of d4T-TP, which is incorporated similarly to ddTTP, confirms the specificity of incorporation.Scanned gel images are cropped and corrected for distortion artifacts with contrast uniformly increased to facilitate the visualization of minor products.(*, Cy5 label).Original scans are provided in a Source Data File.
c, SiteMap analysis of the L1 RT (left) and TERT (right) active sites showing the hydrophilic (teal) and hydrophobic (yellow) environments of the active sites.d, Model of sofosbuvir bound to L1 RT active site (left) and crystal structure of HCV RdRP bound to sofosbuvir (PDB: 4WTG, right).Note the clash between F605 in L1 and the 2'-F of the ligand.The equivalent position in HCV RdRP is D225, which provides sufficient space for the 2'-group.Additionally, N291 in HCV RdRP is within hydrogen-bonding distance of the of the 2' group while equivalent residue in L1 RT is F668, which precludes hydrogen bond formation.
containing N-terminal HERV-K and TEV proteases followed by TEV cleavage site, resulting in a single N-terminal glycine scar.b, Gels and template:primer system corresponding to single nucleotide incorporation data in Fig. 3b.Asterisk (*) 32 P-labeled 5'-end of the primer.c, Full length ORF2p and ORF2p core are compared in single nucleotide incorporation and inhibition experiments with the indicated nucleoside triphosphates and 3TC triphosphate; 'dNTPx4' is a mix of all four standard dNTPs.Full length ORF2p (purity insufficient to accurately determine concentration) produces similar reaction products and shows similar activity and inhibition to both partially-purified (Heparin) and fully-purified (after SEC) ORF2p core.d, Representative Coomassie stained SDS-Page of BS3-crosslinked ORF2p core protein, following reaction with various concentrations of BS3 in the presence of an annealed RNA template:DNA primer duplex.While electrophoretic mobility of crosslinked monomers may be challenging to predict, higher molecular weight species not present in the starting material (0 μM BS3) are likely enriched in intermolecular XLs, rather than desired intramolecular XLs.Based on this criteria, 10 and 30 μM BS3 products analyzed by MS. e, 56 unique crosslinks from ORF2p core and full-length ORF2p mapped onto the AlphaFold2 model of ORF2p (used as a starting point for integrative modeling); 91% of experimental crosslinks are satisfied.Original scans are provided in a Source Data File.
distributions are significantly different (p=10 -28 ), with the RNA-only particles smaller, highlighted by the inset cumulative distribution function (CDF) plot.c, Validation of radius of projection in 2D class averages as a metric of particle radius (radius of gyration), as the relation between the radius of projection and the radius of gyration of a model is non-linear.For some specific orientations of particles (Min) the radius of projection can be small and almost independent of the radius of gyration, however the average and maximum (Max) radii of projection show a strong linear correlation with the radius of gyration of a model (r=0.82(average radius) and r=0.88 (maximum radius), p<10 to each other.Where available, both the crystal coordinates and those from AlphaFold2 were compared (ORF2-crystal-full is 238-1061; ORF2-full-AF is 1-1275, etc.).Proteins with perplexity < 10 -20 are shown; above this value, in most groups the "other protein classes", which may generally be viewed as 'decoys' start to score.Outside of this to ORF2p itself, the EN, tower, and wrist domains all have no significant hits in this set; CTD has very weak similarity to TERT and the Group IIB intron from Thermosynechococcus vestitus.The ancestral palm subdomain has very low perplexity with many polymerases in the set and recapitulates many of the relationships seen with the full crystal structure: ORF2p palm is predicted to be most similar to the other non-LTR transposon, R2Bm, followed by Group II mobile introns, HCV and influenza RdRPs, and domesticated cellular RTs, including PRP8 and TERT, followed more distantly by retroviral RTs.Again, the inactive p51 conformations of HIV-1/2 RT are predicted to be much more distant from ORF2p than the active p66 conformations, which are identical in sequence up to a deletion.The fingers and then thumb subdomains are each predicted to be less similar than palm to smaller numbers of these proteins, but in roughly similar orders, although interestingly in the thumb R2Bm is predicted to be slightly less similar to ORF2p than some of the evolutionarily more distant proteins such as TERT and Group II introns.
is representative of >15 experiments in four laboratories; the purification in Supplementary Fig.9is representative of 3 experiments.Negative stain experiments were performed at least twice with each bound nucleic acid species.For electrophoresis, original scans of cropped gels and blots are provided in a Source Data File.cryoSPARC derived reference-free 2D classification of ORF2p core with clear secondary structure visible in class averages.c, Summary of single particle analysis for reconstructions of ORF2p core in different nucleotide ligand states.From an initial untilted data set of ORF2p core bound to the template:primer hybrid, a 3.58 Å resolution reconstruction was obtained with clear density for the bound hybrid though a Fourier gap was identified.To fill in Fourier gaps, additional datasets were collected at 30° tilt for ORF2p core bound to template:primer hybrid and ssRNA.Cryo-EM data were processed by motion correcting movies in MotionCorr2 followed by import into cryoSPARC where micrographs were CTF corrected.An initial set of 2D class averages from a subset of the data were used for template-based particle picking.Picked particles were sorted by 2D classification and the tilted and untilted datasets were combined in Relion 3.1 for 3D classification.The most complete 3D classes were selected and refined with iterative rounds of 3D auto refinement, CTF refinement and Bayesian polishing.Final maps were obtained by importing particles and refined CTF values into cryoSPARC for non-uniform refinement.Tilted data for apo ORF2p was not necessary because a larger range of views were obtained from untilted data. of ORF2p bound to respective substrates at FSC threshold of 0.143 (dotted line).b, Orientation distribution plots for ORF2p core cryo-EM reconstructions show complete orientation coverage.c, Single particle reconstructions of ORF2p core colored by local resolution as calculated by MonoRes.For all maps, the palm and flanking fingers and thumb are the highest resolution portions of the reconstruction with more distal elements (wrist or tower) being more flexible relative to palm and more poorly resolved.
-38for both, two-tailed Pearson correlation).d, Multi-dimensional scaling comparison of 2D classes from negative stain EM of ORF2p bound to a short RNA17:DNA14 hybrid or long (376 nt) L1 template RNA shows overlap in many classes from both but key differences.e, Hierarchical clustering of structures from RNA template-and RNA:DNA hybrid-bound class averages representing closed and open states.Raw 2D class averages, determined by k-means clustering, their inverted contour plots, superpositions with best-matching structure, contour plots of generated projections, and distribution of scores (lower is closer match) for all orientations of 101-best-matching models.